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Abstract: This paper is aimed at designing efficient parallel matrix-product algorithms 
for heterogeneous master-worker platforms. While matrix-product is well-understood for ho- 
mogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product 
algorithm), there are three key hypotheses that render our work original and innovative: 

- Centralized data. We assume that all matrix files originate from, and must be returned 
to, the master. The master distributes both data and computations to the workers (while 
in ScaLAPACK, input and output matrices are initially distributed among participating 
resources). Typically, our approach is useful in the context of speeding up MATLAB or 
SCILAB clients running on a server (which acts as the master and initial repository of files). 

- Heterogeneous star- shaped platforms. We target fully heterogeneous platforms, where com- 
putational resources have different computing powers. Also, the workers arc connected to 
the master by links of different capacities. This framework is realistic when deploying the 
application from the server, which is responsible for enrolling authorized resources. 

- Limited memory. Because we investigate the parallelization of large problems, we cannot 
assume that full matrix panels can be stored in the worker memories and re-used for sub- 
sequent updates (as in ScaLAPACK). The amount of memory available in each worker is 
expressed as a given number of buffers, where a buffer can store a square block of matrix 
elements. The size q of these square blocks is chosen so as to harness the power of Level 3 
BLAS routines: q = 80 or 100 on most platforms. 

We have devised efficient algorithms for resource selection (deciding which workers to 
enroll) and communication ordering (both for input and result messages), and we report a 
set of numerical experiments on various platforms at Ecole Normale Superieure de Lyon and 

This text is also available as a research report of the Laboratoirc de ITnformatiquc du Parallclisme 
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the University of Tennessee. However, we point out that in this first version of the report, 
experiments arc limited to homogeneous platforms. 

Key-words: Matrix product, LU decomposition. Master-worker platform. Heterogeneous 
platforms, Scheduling 
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Produit de matrice sur plate-forme maitre-esclave 



Resume : Ce papier a pour objectif la definition d'algorithmes efBcaces pour le produit 
de matrices en parallele sur plate-formes maitre-esclaves heterogenes. Bien que le produit 
de matrices soit bien compris pour des grilles bi-dimensionnelles de processeurs homogenes 
(cf. I'algorithme de Cannon et le produit externe de ScaLAPACK), trois hypotheses rcndent 
notre travail original: 

- Donnees centralisees. Nous supposons que toutcs Ics matrices resident originellement sur 
le maitre, et doivent y etre renvoyees. Le maitre distribue donnees et calculs aux esclaves 
(alors que dans ScaLAPACK, les matrices initiales et resultats sont initiallement distribuees 
aux processeurs participant). Typiqucmcnt, notrc approchc est justificc dans le contcxte de 
I'acceleration de clients MATLAB ou SCILAB s'exccutant sur un scrvcur (qui sc comporte 
commc le maitre et detient initiallement les donnees). 

- Plates-formes heterogenes en etoile. Nous nous interessons a des plates-formes com- 
pletement heterogenes dont les ressources de calculs out des puissances de calcul differentes 
et dont les esclaves sont relies au maitre par des liens de capacites differentes. Ce cadre de 
travail est realiste quand I'application est deployee a partir du serveur qui est responsable 
de I'cnrolement des ressources necessaires. 

- Memoire bornee. Commc nous nous interessons a la parallclisation dc gros problcmes, 
nous nc pouvons pas supposer que toutcs les sous-matrices pcuvent etre stockecs dans la 
memoire de chaquc esclavc pour etre cventucUcment rcutilisce ulterieurement (comme c'est 
le cas dans ScaLAPACK). La quantite de memoire disponible sur un esclave donne est ex- 
prime comme un nombre de tampons, oii un tampon pent exactement contcnir un bloc 
carre d'elements de matrice. La taille q de ces blocs carres est choisie afin de pouvoir tirer 
parti de la puissance des routines BLAS de niveau 3: q = 80 ou 100 sur la plupart des 
plates-formes. 

Nous avons defini des algorithmes efRcaces pour la selection de ressources (pour decider 
quel(s) esclave(s) utiliser) et Tordonnancement des communications (envoi de donnees et 
recuperations de resultats), et nous rapportons un ensemble d'experiences sur des plates- 
formes a I'Ecole normale superieure de Lyon et a TUniversite du Tennessee. Nous faisons 
cependant remarquer que dans la premiere version de ce rapport les experiences ne concer- 
nent que des plates- formes homogenes. 

Mots-cles : Produit de matrices, Decomposition LU, Plates-formes maitre-esclaves. 
Plates-formes heterogenes, Ordonnancement 
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1 Introduction 

Matrix product is a key computational kernel in many scientific applications, and it has 
been extensively studied on parallel architectures. Two well-known parallel versions are 
Cannon's algorithm [14] and the ScaLAPACK outer product algorithm [13]. Typically, 
parallel implementations work well on 2D processor grids, because the input matrices are 
sliced horizontally and vertically into square blocks that are mapped one-to-one onto the 
physical resources; several communications can take place in parallel, both horizontally and 
vertically. Even better, most of these communications can be overlapped with (independent) 
computations. All these characteristics render the matrix product kernel quite amenable to 
an cfRcient parallel implementation on 2D processor grids. 

However, current architectures typically take the form of heterogeneous clusters, which 
are composed of heterogeneous computing resources, interconnected by a sparse network: 
there are no direct links between any pair of processors. Instead, messages from one processor 
to another are routed via several links, likely to have different capacities. Worse, congestion 
will occur when two messages, involving two different sender/receiver pairs, collide because 
a same physical link happens to belong to the two routing paths. Therefore, an accurate 
estimation of the communication cost requires a precise knowledge of the underlying target 
platform. In addition, it becomes necessary to include the cost of both the initial distribution 
of the matrices to the processors and of collecting back the results. These input/output 
operations have always been neglected in the analysis of the conventional algorithms. This 
is because only 0{n^) coefficients need to be distributed in the beginning, and gathered at 
the end, as opposed to the 0{n^) computations to be performed (where n is the problem 
size). The assumption that these communications can be ignored could have made sense 
on dedicated processor grids like, say, the Intel Paragon, but it is no longer reasonable on 
heterogeneous platforms. 

There are two possible approaches to tackle the parallelization of matrix product on 
heterogeneous clusters when aiming at reusing the 2D processor grid strategy. The first 
(drastic) approach is to ignore communications. The objective is then to load-balance com- 
putations as evenly as possible on a heterogeneous 2D processor grid. This corresponds to 
arranging the n available resources as a (virtual) 2D grid of size p x q (where p.q < n) so 
that each processor receives a share of the work, i.e., a rectangle, whose area is proportional 
to its relative computing speed. There are many processor arrangements to consider, and 
determining the optimal one is a highly combinatorial problem, which has been proven NP- 
complete in [5]. In fact, because of the geometric constraints imposed by the 2D processor 
grid, a perfect load-balancing can only be achieved in some very particular cases. 

The second approach is to relax the geometric constraints imposed by a 2D processor 
grid. The idea is then to search for a 2D partitioning of the input matrices into rectangles 
that will be mapped one-to-one onto the processors. Because the 2D partitioning now is 
irregular (it is no longer constrained to a 2D grid) , some processors may well have more than 
four neighbors. The advantage of this approach is that a perfect load-balancing is always 
possible; for instance partitioning the matrices into horizontal slices whose vertical dimen- 
sion is proportional to the computing speed of the processors always leads to a perfectly 
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balanced distribution of the computations. The objective is then to minimize the total cost 
of the communications. However, it is very hard to accurately predict this cost. Indeed, 
the processor arrangement is virtual, not physical: as explained above, the underlying inter- 
connection network is not expected to be a complete graph, and communications between 
neighbor processors in the arrangement are likely to be realized via several physical links 
constituting the communication path. The actual repartition of the physical links across all 
paths is hard to predict, but contention is almost certain to occur. This is why a natural, 
although pessimistic assumption, to estimate the communication cost, is to assume that all 
communications in the execution of the algorithm will be implemented sequentially. With 
this hypothesis, minimizing the total communication cost amounts to minimizing the to- 
tal communication volume. Unfortunately, this problem has been shown NP-complete as 
well [6] . Note that even under the optimistic assumption that all communications at a given 
step of the algorithm can take place in parallel, the problem remains NP-complcte [7] . 

In this paper, we do not try to adapt the 2D processor grid strategy to heterogeneous 
clusters. Instead, we adopt a realistic application scenario, where input files are read from 
a fixed repository (disk on a data server). Computations will be delegated to available 
resources in the target architecture, and results will be returned to the repository. This 
calls for a master-worker paradigm, or more precisely for a computational scheme where the 
master (the processor holding the input data) assigns computations to other resources, the 
workers. In this centralized approach, all matrix files originate from, and must be returned 
to, the master. The master distributes both data and computations to the workers (while 
in ScaLAPACK, input and output matrices are supposed to be equally distributed among 
participating resources beforehand). Typically, our approach is useful in the context of 
speeding up MATLAB or SCILAB clients running on a server (which acts as the master and 
initial repository of files). 

We target fully heterogeneous master-worker platforms, where computational resources 
have different computing powers. Also, the workers are connected to the master by links 
of different capacities. This framework is realistic when deploying the application from the 
server, which is responsible for enrolling authorized resources. 

Finally, because we investigate the parallelization of large problems, we cannot assume 
that full matrix panels can be stored in worker memories and re- used for subsequent updates 
(as in ScaLAPACK). The amount of memory available in each worker is expressed as a given 
number of buffers, where a buffer can store a square block of matrix elements. The size q 
of these square blocks is chosen so as to harness the power of Level 3 BLAS routines: q ~ 80 
or 100 on most platforms. 

To summarize, the target platform is composed of several workers with different comput- 
ing powers, different bandwidth links to/from the master, and different, limited, memory 
capacities. The first problem is resource selection. Which workers should be enrolled in 
the execution? All of them, or maybe only the faster computing ones, or else only the 
faster-communicating ones? Once participating resources have been selected, there remain 
several scheduling decisions to take: how to minimize the number of communications? in 
which order workers should receive input data and return results? what amount of commu- 
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nications can be overlapped with (independent) computations? The goal of this paper is to 
design efficient algorithms for resource selection and communication ordering. In addition, 
we report numerical experiments on various heterogeneous platforms at the Ecolc Normale 
Superieure de Lyon and at the University of Tennessee. 

The rest of the paper is organized as follows. In Section 2, we state the scheduling 
problem precisely, and we introduce some notations. In Section 3, we start with a theoretical 
study of the simplest version of the problem, without memory limitation, which is intended 
to show the intrinsic difficulty of the scheduling problem. Next, in Section 4, we proceed 
with the analysis of the total communication volume that is needed in the presence of 
memory constraints, and we improve a well-known bound by Toledo [38, 27]. We deal 
with homogeneous platforms in Section 5, and we propose a scheduling algorithm that 
includes resource selection. Section 6 is the counterpart for heterogeneous platforms, but 
the algorithms are much more complicated. In Section 7, we briefly discuss how to extend 
previous approaches to LU factorization. We report several MPI experiments in Section 8. 
Section 9 is devoted to an overview of related work. Finally, we state some concluding 
remarks in Section 10. 



2 Framework 

In this section we formally state our hypotheses on the application (Section 2.1) and on the 
target platform (Section 2.2). 
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Figure 1: Partition of the three matrices A, Figure 2: A fully heterogeneous master- 
B, and C. worker platform. 
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2.1 Application 

We deal with the computational kernel C C + A x B. We partition the three matrices A, 
B, and C as illustrated in Figure 1. More precisely: 

• We use a block-oriented approach. The atomic elements that we manipulate are not 
matrix cocfBcicnts but instead square blocks of size q x q (hence with q^ coefficients). 
This is to harness the power of Level 3 BLAS routines [f2]. Typically, g = 80 or 100 
when using ATLAS-generated routines [40]. 

• The input matrix A is of size x n^g : 

- we split A into r horizontal stripes Ai, 1 < « < r, where r = nj\^/ q\ 

- we split each stripe Ai into t square qx q blocks Ai^k, ^ l£ k <t, where t — Uj^^/q. 

• The input matrix B is of size x : 

- we split B into s vertical stripes Bj, 1 < j < s, where s = nj^jq^ 

- we split stripe Bj into t square qx q blocks Sfcj, \ <k <t. 

• We compute C ^ C + A x B . Matrix C is accessed (both for input and output) by 
square qx q blocks Cij-, 1 < « < r, 1 < j < s. There are r x s such blocks. 

We point out that with such a decomposition all stripes and blocks have same size. This 
will greatly simplify the analysis of communication costs. 

2.2 Platform 

We target a star network S = {Po, Pi, P2, . . . , Pp}, composed of a master Pq and of p workers 
Pi, 1 < * < P (see Figure 2). Because we manipulate large data blocks, we adopt a linear 
cost model, both for computations and communications (i.e., we neglect start-up overheads). 
We have the following notations: 

• It takes X.Wi time- units to execute a task of size X on Pi; 

• It takes X.Ci time units for the master Pg to send a message of size X to P; or to 
receive a message of size X from Pi. 

Our star platforms are thus fully heterogeneous, both in terms of computations and 
of communications. A fully homogeneous star platform would be a star platform with 
identical workers and identical communication links: Wi = w and c; ~ c for each worker 
Pi, 1 ^ i ^ P- Without loss of generality, we assume that the master has no processing 
capability (otherwise, add a fictitious extra worker paying no communication cost to simulate 
computation at the master). 

Next, we need to define the communication model. We adopt the one-port model [10, 11], 
which is defined as follows: 

• the master can only send data to, and receive data from, a single worker at a given 
time-step, 
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• a given worker cannot start execution before it has terminated the reception of the 
message from the master; similarly, it cannot start sending the results back to the 
master before finishing the computation. 

In fact, this one-port model naturally comes in two flavors with return messages, depending 
upon whether we allow the master to simultaneously send and receive messages or not. If 
we do allow for simultaneous sends and receives, we have the two-port model. Here we 
concentrate on the true one-port model, where the master cannot be enrolled in more than 
one communication at any time-step. 

The one-port model is realistic. Bhat, Raghavendra, and Prasanna [10, 11] advocate its 
use because "current hardware and software do not easily enable multiple messages to be 
transmitted simultaneously." Even if non-blocking multi-threaded communication libraries 
allow for initiating multiple send and receive operations, they claim that all these opera- 
tions "are eventually serialized by the single hardware port to the network." Experimental 
evidence of this fact has recently been reported by Saif and Parashar [35] , who report that 
asynchronous MPI sends get serialized as soon as message sizes exceed a hundred kilobytes. 
Their result hold for two popular MPI implementations, MPICH on Linux clusters and IBM 
MPI on the SP2. Note that all the MPI experiments in Section 8 obey the one-port model. 

The one-port model fully accounts for the heterogeneity of the platform, as each link has 
a different bandwidth. It generalizes a simpler model studied by Banikazemi, Moorthy, and 
Panda [1], Liu [32], and KhuUer and Kim [30]. In this simpler model, the communication 
time only depends on the sender, not on the receiver. In other words, the communication 
speed from a processor to all its neighbors is the same. This would restrict the study to bus 
platforms instead of general star platforms. 

Our final assumption is related to memory capacity; we assume that a worker Pi can only 
store rui blocks (either from A, B, or C). For large problems, this memory limitation will 
considerably impact the design of the algorithms, as data re-use will be greatly dependent 
on the amount of available buffers. 

3 Combinatorial complexity of a simple version of the 
problem 

This section is almost a digression; it is devoted to the study of the simplest variant of the 
problem. It is intended to show the intrinsic combinatorial difficulty of the problem. We 
make the following simplifications: 

• We target a fully homogeneous platform (identical workers and communication links). 

• We consider only rank-one block updates; in other words, and with previous notations, 
we focus on the case where t = 1. 

• Results need not be returned to the master. 
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• Workers have no memory limitation; they receive each stripe only once and can re-use 
them for other computations. 

There are five parameters in the problem; three platform parameters (c, and the 
number of workers p) and two application parameters (r and s). The scheduling problem 
amounts to deciding which files should be sent to which workers and in which order. A 
given file may well be sent several times, to further distribute computations. For instance, a 
simple strategy is to partition A and to duplicate S, i.e., send each block Ai only once and 
each block Bj p times; all workers would then be able to work fully in parallel. 




Figure 3: Dependence graph of the problem (with r = 3 and s = 2). 

The dependence graph of the problem is depicted in Figure 3. It suggests a natural 
strategy for enabling workers to start computing as soon as possible. Indeed, the master 
should alternate sending ^-blocks and S-blocks. Of course it must be decided how many 
workers to enroll and in which order to send the blocks to the enrolled workers. But with a 
single worker, we can show that the alternating greedy algorithm is optimal: 

Proposition 1. With a single worker, the alternating greedy algorithm is optimal. 

Proof. In this algorithm, the master sends blocks as soon as possible, alternating a block 
of type A and a block of type B (and proceeds with the remaining blocks when one type 
is exhausted). This strategy maximizes at each step the total number of tasks that can be 
processed by the worker. To see this, after x communication steps, with y files of type A 
sent, and z files of type B sent, where y + z ^ x, the worker can process at most y x z tasks. 
The greedy algorithm enforces y = [^] and z = \_^\ (as long as max(a;,?;) < min(r, s), and 
then sends the remaining files), hence its optimality. □ 

Unfortunately, for more than one worker, we did not succeed in determining an optimal 
algorithm. There are (at least) two greedy algorithms that can be devised for p workers: 

Thrifty: This algorithm "spares" resources as it aims at keeping each enrolled worker fully 
active. It works as follows: 

• Send enough blocks to the first worker so that it is never idle. 



RR n° 0123456789 



10 



J. Dongarra, J.-F. Pineau, Y. Robert, Z. Shi, F. Vivien 



Thrifty 



Mill-Mill 



i>i I "1 1 111 r°2i H 



bi ";i I ^2 I ^ 



t, I Ui I (l. 



bl I I I '^2 I f*:! I "2 I 



'.52 '".i:! "■2;i 



(a) 



I f^M^ I ^ 



■"'22 ii'v.i if'-2:i <i':Hi i'';!2 



"'11 "'12 



Kte i''51 U'd 



(b) 



Figure 4: Neither Thrifty nor Min-min is optimal: (a) with p = 2, c ~ A, w = 7, and 
r = s = 3, Min-min wins; (b) with p = 2, c = 8, w = 9, r = 6, and s = 3, Thrifty wins. 



• Send blocks to a second worker during spare communication slots, and 

• Enroll a new worker (and send blocks to it) only if this docs not delay previously 
enrolled workers. 

Min-min: This algorithm is based on the well-known min-min heuristic [33]. At each step, 
all tasks are considered. For each of them, we compute their possible starting date on 
each worker, given the files that have already been sent to this worker and all decisions 
taken previously; we select the best worker, hence the first min in the heuristic. We 
take the minimum of starting dates over all tasks, hence the second min. 

It turns out that neither greedy algorithm is optimal. See Figure 4(a) for an example 
where Min-min is better than Thrifty, and Figure 4(b) for an example of the opposite 
situation. 

We now go back to our original model. 



4 Minimization of the communication volume 

In this section, we derive a lower bound on the total number of communications (sent from, 
or received by, the master) that are needed to execute any matrix multiplication algorithm. 
We point out that, since we are not interested in optimizing the execution time (a difficult 
problem, according to Section 3) but only in minimizing the total communication volume, we 
can simulate any parallel algorithm on a single worker. Therefore, we only need to consider 
the one- worker case. 

We deal with the original, and realistic, formulation of the problem as follows: 

• The master sends blocks Aik , Bkj , and Cy , 
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Figure 5: Memory usage for the maximum re-use algorithm when m = 21: /i = 4; 1 block 
is used for A, fi for B, and fj,'^ for C. 
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Figure 6: Four steps of the maximum re-use algorithm, with m = 21 and /i = 4. The 
elements of C updated are displayed on white on black. 



• The master retrieves final values of blocks Cij , and 

• We enforce limited memory on the worker; only m buffers are available, which means 
that at most m blocks of ^, B, and/or C can simultaneously be stored on the worker. 

First, we describe an algorithm that aims at re- using C blocks as much as possible after 
they have been loaded. Next, we assess the performance of this algorithm. Finally, we 
improve a lower bound previously established by Toledo [38, 27]. 

4.1 The maximum re-use algorithm 

Below we introduce and analyze the performance of the maximum re-use algorithm, whose 
memory management is illustrated in Figure 5. Four consecutive execution steps are shown 
in Figure 6. Assume that there are m available buffers. First we find fi as the largest integer 
such that 1 + /i + /i^ < TO. The idea is to use one buffer to store A blocks, ^ buffers to 
store B blocks, and fi^ buffers to store C blocks. In the outer loop of the algorithm, a, fi x n 
square of C blocks is loaded. Once these /i^ blocks have been loaded, they arc repeatedly 
updated in the inner loop of the algorithm until their final value is computed. Then the 
blocks are returned to the master, and /i^ new C blocks are sent by the master and stored by 
the worker. As illustrated in Figure 5, we need fj, buffers to store a row of B blocks, but only 
one buffer for A blocks: A blocks are sent in sequence, each of them is used in combination 
with a row oi pi B blocks to update the corresponding row of C blocks. This leads to the 
following sketch of the algorithm: 

Outer loop: while there remain C blocks to be computed 
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• Store blocks of C in worker's memory: 

send a /i X /i square {Ci,j / iq < i < io + jo < j < jo + m} 

• Inner loop: For each k from 1 to t: 

1. Send a row of /i elements {Bk.j / jo < j < jo + m}; 

2. Sequentially send ^ elements of column {Ai^k / io < * < *o + m}- For each Ai^k, 
update fj. elements of C 

• Return results to master. 

4.2 Performance and lower bound 

The performance of one iteration of the outer loop of the maximum re-use algorithm can 
readily be determined: 

• We need 2^^ communications to send and retrieve C blocks. 

• For each value of t: 

- we need ^ elements of A and fi elements of B; 

- we update js^ blocks. 

In terms of block operations, the communication-to-computation ratio achieved by the al- 
gorithm is thus 

fi'^t t fi 

For large problems, i.e., large values of t, we see that CCR is asymptotically close to the 
value CCRoo = We point out that, in terms of data elements, the communication-to- 
computation ratio is divided by a factor q. Indeed, a block consists of coefficients but an 
update requires q^ floating-point operations. 

How can we assess the performance of the maximum re-use algorithm? How good is the 
value of CCR? To sec this, we refine an analysis due to Toledo [38]. The idea is to estimate 
the number of computations made thanks to m consecutive communication steps (again, the 
unit is a matrix block here). We need some notations: 

• We let aoid, Poid, and ^oid be the number of buffers dedicated to A, B, and C at the 
beginning of the m communication steps; 

• We let Urecvi Precv, and 7recu bc the number of A, B, and C blocks sent by the master 
during the m communication steps; 

• Finally, we let jsend be the number of C blocks returned to the master during these m 
steps. 
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Obviously, the following equations must hold true: 

aoid + 0oid + loid < m 

^recv ~t~ Precv ~^ '~1recv ~^ '~i se7id — ^ 

The following lemma is given in [38] : consider any algorithm that uses the standard way 
of multiplying matrices (this excludes Strassen's or Winograd's algorithm [19], for instance). 
If Na elements of Nb elements of B and Nc elements of C are accessed, then no more 
than K computations can be done, where 



K = min 



[{Na + Nb)^/n^, {Na + Nc)^N^, {Nb + Nc)^N^} • 



To use this result here, we see that no more than aoid + ctrecv blocks of A are accessed, hence 
Na = {ctoid + arecv)q^- Similarly, Nb = (/3o;d + I3recv)q^ and Nc = {joid + lrecv)q^ (the C 
blocks returned are already counted). We simplify notations by writing: 

O-old + Olrecv = am 
Pold + Precv = Pm 
lold + Irecv = im 

Then we obtain 

K = min |(q; + /3)V7, {P + l)Va, (7 + a)^/?} x m^/mq^ . 
Writing K ~ km^/mq^ , we obtain the following system of equations 

Maximize fc s.t. 
fc< (a + /3)V7 
k < (/3 + 7)yS 

k < (7 + a)v//3 
a + /3 + 7<2 

whose solution is easily found to be 

a — P — J 



-, AND k 
o 



This gives a lower bound for the communication-to-computation ratio (in terms of blocks) 
of any algorithm: 

m 



CCR, 



opt 



27 
32m' 



In fact, it is possible to refine this bound. Instead of using the lemma given in [38], 
we use Loomis- Whitney inequality [27]: if Na elements of A, Nb elements of B, and Nc 
elements of C are accessed, then no more than K computations can be done, where 



K = ^NaNbNc. 
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Here 





The maximum re-use algorithm docs not achieve the lower bound: 



CCRjQo — — ^— — 




but it is quite close! 



Finally, we point out that the bound CCRopt improves upon the best-known value y 
derived in [27]. Also, the ratio CCRoo achieved by the maximum re-use algorithm is lower 
by a factor -y/S than the ratio achieved by the blocked matrix-multiply algorithm of [38]. 

5 Algorithms for homogeneous platforms 

In this section, we adapt the maximum re-use algorithm to fully homogeneous platforms. 
In this framework, contrary to the simplest version, we have a limitation of the memory 
capacity. So we must first decide which part of the memory will be used to stock which 
part of the original matrices, in order to maximize the total number of computations per 
time unit. Cannon's algorithm [14] and the ScaLAPACK outer product algorithm [13] both 
distribute square blocks of C to the processors. Intuitively, squares are better than elongated 
rectangles because their perimeter (which is proportional to the communication volume) is 
smaller for the same area. We use the same approach here, but we have not been able to 
assess any optimal result. 

Principle of the algorithm 

We load into the memory of each worker ^ q x q blocks of A and fi q x q blocks of B to 
compute jj? q X q blocks of C. In addition, we need extra buffers, split into /z buffers 
for A and ^ for B, in order to overlap computation and communication steps. In fact, ^ 
buffers for A and /x for B would suffice for each update, but we need to prepare for the next 
update while computing. Overall, the number of C blocks that we can simultaneously load 
into memory is the largest integer /i such that 



+ 4// < TO. 
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We have to determine the number of participating workers *p. For that purpose, we 
proceed as foUows. On the communication side, we know that in a round (computing a 
C block entirely), the master exchanges with each worker 2yU^ blocks of C {fi^ sent and /x^ 
received), and sends fit blocks of A and fit blocks of B. Also during this round, on the 
computation side, each worker computes fi^t block updates. 

If we enroll too many processors, the communication capacity of the master will be 
exceeded. There is a limit on the number of blocks sent per time unit, hence on the maximal 
processor number *p, which we compute as follows: *p is the smallest integer such that 

2/itc X «p > fi^tw. 

Indeed, this is the smallest value to saturate the communication capacity of the master 
required to sustain the corresponding computations. We derive that 



' fi^tw' 




' flW 


2/itc 




2c 



Mis 

2 T, 



z/iic zc 

In the context of matrix multiplication, we have c = q^Tc and w — q^Ta, hence *P 
. Moreover, we need to enforce that *p < p, hence we finally obtain the formula 



*P = min < p. 



2 Te 

For the sake of simplicity, we suppose that r is divisible by and that s is divisible by 
^jjL. We allocate n block columns (i.e., q^x consecutive columns of the original matrix) of C 
to each processor. The algorithm is decomposed into two parts. Algorithm 1 outlines the 
program of the master, while Algorithm 2 is the program of each worker. 

Impact of the start-up overhead 

If we follow the execution of the homogeneous algorithm, we may wonder whether we can 
really neglect the input /output of C blocks. Contrary to the greedy algorithms for the 
simplest instance described in Section 3, we sequentialize here the sending, computing, and 
receiving of the C blocks, so that each worker loses 2c time- units per block, i.e., per tw 
time-units. As there are ^ < ^ + 1 workers, the total loss would be of 2c*p time-units 
every tw time-units, which is less than ^ + j^- For example, with c = 2, w = 4.5, /i = 4 and 
t = 100, we enroll ^ = 5 workers, and the total lost is at most 4%, which is small enough to 
be neglected. Note that it would be technically possible to design an algorithm where the 
sending of the next block is overlapped with the last computations of the current block, but 
the whole procedure gets much more complicated. 



Dealing with "small" matrices or platforms 



We have shown that our algorithm should use *p = min |p, | processors, each of 

them holding /i^ blocks of matrix C. For this solution to be feasible, C must be large enough. 
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Algorithm 1: Homogeneous version, master program. 



fi <- [V4 + m - 2j ; 
<P ^ min|p, [^] |; 

Split the matrix into squares Ci'j' of /i^ blocks (of size q x q): 

c^',3' - \ - i)m + 1 < « < if - i)m + 1 < j < iW; 

for /' ^ to ^ by Step *P do 
for i' ^ 1 to - do 

for idyjoTker ^ 1 to *p do 

Send block C^'j' to worker idworker] 
for fc ^ 1 to i do 

for id^orker ^ 1 to <P do 
j ^ j ~^ ^d^iijorker 7 

for j <— (/ — l)/i + 1 to j'/i do 
|_ Send Bk.f, 

for i ^ {i' -~ l)fi + 1 to i' n do 
|_ Send Auk] 



for id 



J 



worker 



1 to <p do 



J ~t~ idnjorker 7 



Receive Ci'.y from worker idworker', 



Algorithm 2: Homogeneous version, worker program. 



for all blocks do 

Receive Ciiji from master; 
for fc ^ 1 to t do 

for j (/ — 1)^ + 1 to j'/i do Receive Bkj; 
for i ^ (i' — l)/i + 1 to i'/i do 
Receive Ai^k', 

for j <- (/ - l)/i + 1 to j'V do 



Cij + Ai^k-Bk,^ 



Return C,/.,' to master; 



In other words, this solution can be implemented if and only if r x s > min |p, | fJ-^- 

If C is not large enough, we will only use O < *P processors, each of them holding i'^ blocks 
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of C, such that: 



Qi^^ < r X s 
v'^w < 2v£lc 



Qv"^ < r X s 



2c 



following the same line of reasoning as previously. We obviously want ly to be the largest 
possible in order for the communications to be most beneficial. For a given value of v we 
want to be the smallest to spare resources. Therefore, the best solution is given by the 
largest value of such that: 

'VW~\ n 

v < r X s. 



2c 



and then Q. = [^] . 

If the platform does not contain the desired number of processors, i.e., if *p > p in the 
case of a "large" matrix C or if > p otherwise, then we enroll all the p processors and we 
give them v"^ blocks of C with v = min T^pj, following the same line of reasoning as 

previously. 



6 Algorithms for heterogeneous platforms 

In this section, all processors are heterogeneous, in term of memory size as well as compu- 
tation or communication time. As in the previous section, rui is the number of q x q blocks 
that fit in the memory of worker Pi , and we need to load into the memory of Pi blocks 
of A, 2fii blocks of B, and fif blocks of C. This number of blocks loaded into the memory 
changes from worker to worker, because it depends upon their memory capacities. We first 
compute all the different values of fii so that 

Mi + < mi. 

To adapt our maximum re-use algorithm to heterogeneous platforms, we first design a 
greedy algorithm for resource selection (Section 6.1), and we discuss its limitations. We 
introduce our final algorithm for heterogeneous platforms in Section 6.2. 



6.1 Bandwidth-centric resource selection 

Each worker Pi has parameters c^, Wi, and fii, and each participating Pi needs to receive 
Si = 2^itci blocks to perform (jii = tfifwi computations. Once again, we neglect I/O for C 
blocks. Consider the steady-state of a schedule. During one time-unit. Pi receives a certain 
amount yi of blocks, both of A and B, and computes Xi C blocks. We express the constraints, 
in terms of communication — the master has limited bandwidth — and of computation — a 
worker cannot perform more work than it receives. The objective is to maximize the amount 
of work performed per time-unit. Altogether, we gather the following linear program: 
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Pi 


P2 


Ci 


1 


20 


Wi 


2 


40 


Mi 


2 


2 




1 


i 




2 


2 





Pi 


P2 


Ps 


Ci 


2 


3 


5 


Wi 


2 


3 


1 




6 


18 


10 




36 


324 


100 




24 


108 


100 



Tabic 1: Platform for which the bandwidth 
centric solution is not feasible. 



Table 2: Platform used to demonstrate the 
processor selection algorithms. 



Maximize J2i 

SUBJECT TO 



Vi, XiWi < 1 

Obviously, the best solution for y,; is yi = — , so the problem can be reduced to 



Maximize J2i 

SUBJECT TO 
Vi, Xi < -rr 



-Xi < 1 



The optimal solution for this system is a bandwidth-centric strategy [8, 3]; we sort workers 
by non-decreasing values of ^ and wc enroll them as long as ^ < 1. In this way, we 

can achieve the throughput p~Y.r enrolled l^- 

This solution seems to be close to the optimal. However, the problem is that workers 
may not have enough memory to execute it! Consider the example described by Table 1. 
Using the bandwidth-centric strategy, every 160 seconds: 

• Pi receives 80 blocks (20 fii x /ii chunks) in 80 seconds, and computes 80 blocks in 
160 seconds; 



• P2 receives 4 blocks (1 ^2 x fj.2 chunk) in 80 seconds, and computes 4 blocks in 160 
seconds. 
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But Pi computes two quickly, and it needs buffers to store as many as 20 blocks to stay 
busy while one block is sent to P2: 

Communications 11111111111111111111 20 11111111111111111111 20 111111111... 
Processor Pi P2 Pi P2 Pi . . . 

Therefore, the bandwidth-centric solution cannot always be realized in practice, and we 
turn to another algorithm described below. To avoid the previous buffer problems, resource 
selection will be performed through a step-by-step simulation. However, we point out that 
the steady-state solution can be seen as an upper bound of the performance that can be 
achieved. 

6.2 Incremental resource selection 

The different memory capacities of the workers imply that we assign them chunks of different 
sizes. This requirement complicates the global partitioning of the C matrix among the 
workers. To take this into account and simplify the implementation, we decide to assign 
only full matrix column blocks in the algorithm. This is done in a two-phase approach. 

In the first phase we pre-compute the allocation of blocks to processors, using a processor 
selection algorithm we will describe later. We start as if we had a huge matrix of size 
00 X X]r=i Each time a processor Pi is chosen by the processor selection algorithm it is 
assigned a square chunk of /if C blocks. As soon as some processor Pi has enough blocks 
to fill up /ii block columns of the initial matrix, we decide that Pi will indeed execute these 
columns during the parallel execution. Therefore we maintain a panel of X^iLi block 
columns and fill them out by assigning blocks to processors. We stop this phase as soon 
as all the r x s blocks of the initial matrix have been allocated columnwise by this process. 
Note that worker Pi will be assigned a block column after it has been selected \ times 
by the algorithm. 

In the second phase we perform the actual execution. Messages will be sent to workers 
according to the previous selection process. The first time a processor Pi is selected, it 
receives a square chunk of ^i C blocks, which initializes its repeated pattern of operation: 
the following t times. Pi receives ^i A and fj.i B blocks, which requires 2fj.iCi time-units. 

There remains to decide which processor to select at each step. We have no closed-form 
formula for the allocation of blocks to processors. Instead, we use an incremental algorithm 
to compute which worker the next blocks will be assigned to. We have two variants of the 
incremental algorithm, a global one that aims at optimizing the overall communication-to- 
computation ratio, and a local one that selects the best processor for the next stage. Both 
variants are described below. 

6.2.1 Global selection algorithm 

The intuitive idea for this algorithm is to select the processor that maximizes the ratio of the 
total work achieved so far (in terms of block updates) over the completion time of the last 
communication. The latter represents the time spent by the master so far, either sending 
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Algorithm 3: Global selection algorithm. 
Data: 

completion-time: the completion time of the last communication 
readyj: the completion time of the work assigned to processor Pi 
nb-blocki: the number of A and B blocks sent to processor Pi 
total-work: the total work assigned so far (in terms of block updates) 
nb-column: the number of fully processed C block columns 

INITIALIZATION 

completion-time ^ 0; 
total-work 0; 
for i ^ 1 to p do 

ready, ^ 0; 

nb-block; ^ 0; 

SIMULATION 

repeat 



until nb-column > s ; 



data to workers or staying idle, waiting for the workers to finish their current computations. 
We have: 



Estimating computations is easy: Pi executes fif block updates per assignment. Commu- 
nications arc slightly more complicated to deal with; we cannot just use the communication 
time 2fiiCi of Pi for the A and B blocks because we need to take its ready time into account. 
Indeed, if Pi is currently busy executing work, it cannot receive additional data too much in 
advance because its memory is limited. Algorithm 3 presents this selection process, which 
we iterate until all blocks of the initial matrix are assigned and computed. 

Running the global selection algorithm on an example. Consider the example de- 

2 

scribed in Table 2 with three workers Pi, P2 and P3. For the first step, we have ratioj ^ ^^'^ 
for all i. We compute ratioi = 1.5, ratio2 = 3, and ratios = 1 and select P2: next ^ 2. We 




ratio 



total work achieved 



completion time of last communication 



INRIA 



Revisiting Matrix Product on Master- Worker Platforms 



21 



update variables as total-work ^ + 324 = 324, completion-time ^ max(0 + 108, 0) = 108, 
ready2 ^ 108 + 972 = 1080 and nb-blocka ^ 36. 

At the second step we compute ratioi -f— ^Qg^24 ~ 2.71, ratio2 ^ "^^loso^^ ^ ^^'^ 
ratios <— ^Qg^^oQ = 2.04 and we select Pi. We point out that P2 is busy until time t = 1080 
because of the first assignment, which we correctly took into account when computing ready2. 
For Pi and P3 the communication could take place immediately after the first one. There 
remains to update variables: total-work <— 324 -I- 36 = 360, completion-time ^ max(108 -I- 
24, 0) = 132, readyi ^ 132 + 72 = 204 and nb-blocki ^ 12. 

At the third step the algorithm selects P3. Going forward, we have a cyclic pattern 
repeating, with 13 consecutive communications, one to P2 followed by 12 ones alternating 
between Pi and P3, and then some idle time before the next pattern (see Figure 7). The 
asymptotic value of ratio is 1.17 while the steady-state approach of Section 6.1 would achieve 
a ratio of 1.39 without memory limitations. Finally, we point out that it is easy to further 
refine the algorithm to get closer to the performance of the steady-state. For instance, 
instead of selecting the best processor greedily, we could look two-steps ahead and search 
for the best pair of workers to select for the next two communications (the only price to pay 
is an increase in the cost of the selection algorithm). From the example, the two-step ahead 
strategy achieves a ratio 1.30. 



M m\ II II II II II I 

a J 



Figure 7: Global selection algorithm on the example of Table 2. 



6.2.2 Local selection algorithm 

The global selection algorithm picks, as the next processor, the one that maximizes the ratio 
of the total amount of work assigned over the time needed to send all the required data. 
Instead, the local selection algorithm chooses, as destination of the i-th communication, the 
processor that maximizes the ratio of the amount of work assigned by this communication 
over the time during which the communication link is used to performed this communication 
(i.e., the elapsed time between the end of {i — l)-th communication and the end of the i-th 
communication). As previously, if processor Pj is the target of the z-th communication, the 
z-th communication is the sending of fj.j blocks of A and blocks of B to processor Pj, 
which enables it to perform /i| updates. 
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More formally, the local selection algorithm picks the worker Pi that maximizes: 



ma.x{2 fiiCi, readyj — completion-time} 



Once again we consider the example described in Table 2. For the first three steps, 
the global and selection algorithms make the same decision. In fact, they take the same 
first 13 decisions. However, for the 14-th selection, the global algorithm picks processor P2 
when the local selection selects processor Pi and then processor P2 for the 15-th decision, 
as illustrated in Figure 8. Under both selection processes, the second chunk of work is sent 
to processor P2 at the same time but the local algorithm inserts an extra communication. 
For this example, the local selection algorithm achieves an asymptotic ratio of computation 
per communication of 1.21. This is better than what is achieved by the global selection 
algorithm but, obviously, there are examples where the global selection will beat the local 
one. 



M 



Pi 
Pi 
Pi 



Figure 8: Local selection algorithm on the example of Table 2. 



7 Extension to LU factorization 

In this section, we show how our techniques can be extended to LU factorization. Wc first 
consider (Section 7.1) the case of a single worker, in order to study how wc can minimize the 
communication volume. Then wc present algorithms for homogeneous clusters (Section 7.2) 
and for heterogeneous platforms (Section 7.3). 

We consider the right-looking version of the LU factorization as it is more amenable to 
parallelism. As previously, we use a block-oriented approach. The atomic elements that we 
manipulate are not matrix coefficients but instead square blocks of size q x q (hence with q'^ 
coefficients). The size of the matrix is then r x r blocks. Furthermore, we consider a second 
level of blocking of size /i. As previously, /i is the largest integer such that /.i^ 4- 4yU < to. The 
main kernel is then a rank-yu update C <— C -f A.B of blocks. Hence the similarity between 
matrix multiplication and LU decomposition. 
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7.1 Single processor case 

The different steps of LU factorization are presented in Figure 9. Step k of the factorization 
consists of the following: 

1. Factor pivot matrix (Figure 9(a)). We compute at each step a pivot matrix of size 
(which thus contains /i^ x coefficients) . This factorization has a communication cost 
of 2/x^c (to bring the matrix and send it back after the update) and a computation 
cost fl^W. 

2. Update the fj, columns below the pivot matrix (vertical panel) (Figure 9(b)). Each row 
X of this vertical panel is of size ^ and must be replaced by xU~^ for a computation 
cost of |/i^W. 

The most communication-efficient policy to implement this update is to keep the pivot 
matrix in place and to move around the rows of the vertical panel. Each row must be 
brought and sent back after update, for a total communication cost of 2/ic. 

At the fc-th step, this update has then an overall commimication cost of 2yu(r — kfj.)c 
and an overall computation cost of ^n^{r — kfi)w. 

3. Update the /x rows at the right of the pivot matrix (horizontal panel) (Figure 9(c)). 
Each column y of this horizontal panel is of size /x and must be replaced by L~^y for 
a computation cost of i/i^w. 

This case is symmetrical to the previous one. Therefore, we follow the same policy 
and at the fc-th step, this update has an overall communication cost of 2fi{r — kfj.)c 
and an overall computation cost of ^/i^(?' — kij.)w. 

4. Update the core matrix (square matrix of the last (r — fc^) rows and columns) (Fig- 
ure 9(d)). This is a rank-^ update. Contrary to matrix multiplication, the most 
communication-efficient policy is to not keep the result matrix in memory, but either 
a, fix ^ square block of the vertical panel or of the horizontal panel (both solutions are 
symmetrical). Arbitrarily, we then decide to keep in memory a chunk of the horizontal 
panel. Then to update a row vector x of the core matrix, we need to bring to that 
vector the corresponding row of the vertical panel, and then to send back the updated 
value of X. This has a communication cost of 3/xc and a computation cost of fj,^. 

At the fc-th step, this update for fi columns of the core matrix has an overall commu- 
nication cost of (/i^ + 3{r~ kfi)fi)c (counting the communications necessary to initially 
bring the elements of the horizontal panel) and an overall computation cost of 
(r — kfi)fi?"w. 

Therefore, at the fc-th step, this update has an overall communication cost of (^ — 
fc)(yU^ + 3(r — fc/.i)/i)c and an overall computation cost of (^ — k){r — kfj^jfi^w. 
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□ I 



in 



(a) The pivot ma- 
trix is factored. 



(b) Update of verti- (c) Update of hori- (d) ^ columns of the 
cal paneL A row x is zontal panel. A col- core matrix are up- 
replaced by xU~^. umn y is replaced by dated using the ver- 
L~^y. tical panel and 

columns of the hori- 
zontal panel. 

Figure 9: Scheme for LU factorization at step k. 



Using the above scheme, the overall communication cost of the LU factorization is 



^ ( 2^2 ^ Afi{r - fc^) + ( - - kj ifi^ + 3(r - kti)fi)j c ^ i^— - + 2fir 



while the overall computation cost is 



2L 



w. 



7.2 Algorithm for homogeneous clusters 



The most time-consuming part of the factorization is the update of the core matrix (it has 
an overall cost of (^r^ — ^M^^ + ^/^^'') Therefore, we want to parallelize this update by 
allocating blocks of /i columns of the core matrix to different processors. Just as for matrix 
multiplication, we would like to determine the optimal number of participating workers 
For that purpose, we proceed as previously. On the communication side, we know that in a 
round (each worker updating /i columns entirely) , the master sends to each worker fi^ blocks 
of the horizontal panel, then sends to each worker the n{r — k^) blocks of the vertical panel, 
and exchanges with each of them 2/i(r — fc/i) blocks of the core matrix {^{r — k^) received and 
later sent back after update). Also during this round, on the computation side, each worker 
computes ^J,^(r — k^) block updates. If we enroll too many processors, the communication 
capacity of the master will be exceeded. There is a limit on the number of blocks sent per 
time unit, hence on the maximal processor number ?p, which we compute as follows: ^ is 
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the smallest integer such that 
We obtain that 



while neglecting the term fj?' in the communication cost, as we assume — to be large. 

Once the resource selection is performed, we propose a straightforward algorithm: a 
single processor is responsible for the factorization of the pivot matrix and of the update of 
the vertical and horizontal panels, and then *p processors work in parallel at the update of 
the core matrix. 



7.3 Algorithm for heterogeneous platforms 

In this section, we simply sketch the algorithm for heterogeneous platforms. When target- 
ing heterogeneous platforms, there is a big difference between LU factorization and matrix 
multiplication. Indeed, for LU once the size fj, of the pivot matrix is fixed, all processors 
have to deal with it, whatever their memory capacities. There was no such fixed common 
constant for matrix multiplication. Therefore, a crucial step for heterogeneous platforms is 
to determine the size /i of the pivot matrix. Note that two pivot matrices at two different 
steps of the factorization may have different sizes, the constraint is that all workers must 
use the same size at any given step of the elimination. 

In theory, the memory size of the workers can be arbitrary. In practice however, memory 
size usually is an integral number of Gigabytes, and at most a few tens of Gigabytes. So it 
is feasible to exhaustively study all the possible values of estimate the processing time for 
each value, and then pick the best one. Therefore, in the following we assume the value of 
/X has been chosen, i.e., the pivot matrix is of a known size n x fi. 

The memory layout used by each slave Pi follows the same policy than as for the homo- 
geneous case: 

- a chunk of the horizontal panel is kept in memory, 

- rows of the horizontal panel are sent to Pi , 

- and rows of the core matrix are sent to Pi and are returned to the master after update. 

If fii — fi, processor Pi operates exactly as for the homogeneous case. But if the memory 
capacity of Pi does not perfectly correspond to the size chosen for the pivot matrix, we still 
have to decide the shape of the chunk of the horizontal panel that processor Pi is going to 
keep in its memory. We have two cases to consider: 

1. fii < ^. In other words, P, has not enough memory. Then we can imagine two different 
shapes for the horizontal panel chunk: 

(a) Square chunk, i.e., the chunk is of size fii x /.i,. Then, for each update the master 
must send to Pi a row of size ^i of the horizontal panel and a row of size fXi 
of the core matrix, and Pi sends back after update the row of the core matrix. 
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Hence a communication cost of S/i^c for computations. The computation-to- 
communication cost induced by this chunk shape is then: 

3^j,iC 3c 

(2 

Then, for each update the master must send to Pi a row of size ^ of the horizontal 

2 

panel and a row of size ^ of the core matrix, and Pi sends back after update 

the row of the core matrix. Hence a communication cost of ^/i -|- 2^^ c for fif 

computations. The computation to communication cost induced by this chunk 
shape is then: 



+ 2^ 

The choice of the policy depends on the ratio Indeed, 

Therefore, the square chunk approach is more efficient if and only if fii < ^/z. 

2. fii > ^. In other words, P,; has more memory than necessary to hold a square matrix 
like the pivot matrix, that is a matrix of size fi x fi. In that case, we propose to divide 



the memory of Pi into 
there were in fact 



square chunks of size ii, and to use this processor as if 
processors with a memory of size fj.^. 



So far, we have assumed we knew the value of fj. and we have proposed memory layout 
for the workers. We still have to decide which processor to enroll in the computation. We 
perform the resource selection as for matrix multiplication: we decide to assign only full 
matrix column blocks of the core matrix and of the horizontal panel to workers, and we 
actually perform resource selection using the same selection algorithms than for matrix- 
multiplication. 

The overall process to define a solution is then: 

1. For each possible value of fi do 

(a) Find the processor which will be the fastest to factor the pivot matrix, and to 
update the horizontal and vertical panels. 

(b) Perform resource selection and then estimate the running time of the update of 
the core-matrix. 

2. Retain the solution leading to the best (estimated) overall running time. 
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8 MPI experiments 

In this section, we aim at validating the previous theoretical results and algorithms. We 
conduct a variety of MPI experiments to compare our new schemes with several other algo- 
rithms from the literature. In the final version of this paper, we will report results obtained 
for heterogeneous platforms, assessing the impact of the degree of heterogeneity (in processor 
speed, link bandwidth and memory capacity) on the performance of the various algorithms. 
For this current version, we restrict to homogeneous platforms. Even in this simpler frame- 
work, using a sophisticated memory management turns out to be very important. 

We start with a description of the platform, and of all the different algorithms that 
we compare. Then we describe the experiments that we have conducted and justify their 
purpose. Finally, we discuss the results. 

8.1 Platform 

For our experiments we are using a platform at the University of Tennessee. All experiments 
are performed on a cluster of 64 Xeon 3.2GIIz dual-processor nodes. Each node of the 
cluster has four Gigabytes of memory and runs the Linux operating system. The nodes are 
connected with a switched 100Mbps Fast Ethernet network. In order to build a master- 
worker platform, we arbitrarily choose one processor as the master, and the other processors 
become the workers. Finally wc used MPI_WTime as timer in all experiments. 

8.2 Algorithms 

We choose six different algorithms from the general literature to compare our algorithm to. 
We partition these algorithms into two sets. The first set is composed of algorithms which 
use the same memory allocation than ours. The only difference between the algorithms is 
the order in which the master sends blocks to workers. 

Homogeneous algorithm (HoLM) is our homogeneous algorithm. It makes resource 
selection, and sends blocks to the selected workers in a round-robin fashion. 

Overlapped Round-Robin, Optimized Memory Layout (ORROML) is very simi- 
lar to our homogeneous algorithm. The only difference between them is that it does 
not make any resource selection, and so sends tasks to all available workers in a round- 
robin fashion. 

Overlapped Min-Min, Optimized Memory Layout (OMMOML) is a static schedul- 
ing heuristic, which sends the next block to the first worker that will be available to 
compute it. As it is looking for potential workers in a given order, this algorithm 
performs some resource selection too. Theoretically, as our homogeneous resource se- 
lection ensures that the first worker is free to compute when we finish to send blocks 
to the others, they should have similar behavior. 
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Overlapped Demand-Driven, Optimized Memory Layout (ODDOML) is a demand- 
driven algorithm. In order to use the extra buffers available in the worker memories, 
it will send the next block to the first worker which can receive it. This would be a 
dynamic version of our algorithm, if it took worker selection into account. 

Demand-Driven, Optimized Memory Layout (DDOML) is a very simple dynamic 
demand-driven algorithm, close to ODDOML. It sends the next block to the first 
worker which is free for computation. As workers never have to receive and compute 
at the same time, the algorithm has no extra buffer, so the memory available to store 
A, B, and C is greater. This may change the value of /i and so the behavior of the 
algorithm. 

In the second set we have algorithms which do not use our memory allocation: 

Block Matrix Multiply (BMM) is Toledo's algorithm [38]. It splits each worker memory 
equally into three parts, and allocate one slot for a square block of A, another for a 
square block of B, and the last one for a square block of C, each square block having 
the same size. Then it sends blocks to the workers in a demand-driven fashion, when 
a worker is free for computation. First a worker receives a block of C, then it receives 
corresponding blocks of A and B in order to update C, until C is fully computed. In 
this version, a worker do not overlap computation with the receiving of the next blocks. 

Overlapped Block Matrix Multiply (OBMM) is our attempt to improve the previous 
algorithm. We try to overlap the communications and the computations of the workers. 
To that purpose, we split each worker memory into five parts, so as to receive one block 
of A and one block of B while previous ones are used to update C. 

8.3 Experiments 

We have built several experimental protocols in order to assess the performance of the 
various algorithms. In the following experiments we use nine processors, one master and 
eight workers. In all experiments we compare the execution time needed by the algorithms 
which use our memory allocation to the execution time of the other algorithms. We also 
point out the number of processors used by each algorithm, which is an important parameter 
when comparing execution times. 

In the first set of experiments, we test the different algorithms on matrices of different 
sizes and shapes. The matrices we are multiplying are of actual size 

- 8000 X 8000 for A and 8000 x 64000 for B, 

- 16000 X 16000 for A and 16000 x 128000 for B, and 

- 8000 X 64000 for A and 64000 x 64000 for B. 

All the algorithms using our optimized memory layout consider these matrices as composed 
of square blocks of size g x q = 80 x 80. For instance in the first case we have r = t = 100 
and s = 800. 

In the second set of experiments we check whether the choice of q was wise. For that 
purpose, we launch the algorithms on matrices of size 8000 x 8000 and 8000 x 64000, changing 
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from one experiment to another the size of the elementary square blocks. Then q will be 
respectively equal to 40 and 80. As the global matrix size is the same in both experiments, 
we expect both results to be the same. 

In the third set of experiments we investigate the impact of the worker memory size 
onto the performance of the algorithms. In order to have reasonable execution times, we 
use matrices of size 16000 x 16000 and 16000 x 64000, and the memory size will vary from 
132MB to 512MB. We choose these values to reduce side effects due to the partition of the 
matrices into blocks of size /ig x iiq. 

In the fourth and last set of experiments we check the stability of the previous results. To 
that purpose we launch the same execution five times, in order to determine the maximum 
gap between two runs. 



8.4 Results and discussion 

We see in Figure 10 the results of the first set of experiments, where algorithms are computing 
different matrices. The first remark is that the shape of the three experiments is the same for 
all matrix sizes. We also underline the superiority of most of the algorithms which use our 
memory allocation against BMM: HoLM, ORROML, ODDOML, and DDOML are 
the best algorithms and have similar performance. Only OMMOML needs more time to 
complete its execution. This delay comes from its resource selection: it uses only two workers. 
For instance, HoLM uses four workers, and is as competitive as the other algorithms which 
all use the eight available workers. 

In Figure 12, we see the impact of q on the performance of our algorithms. BMM and 
OBMM have same execution times in both experiments as these algorithms do not split 
matrices into elementary square blocks of size q x q but, instead, call the Level 3 BLAS 
routines directly on the whole x -^/y" matrices. In the two cases we see that the time 
of the algorithms are similar. We point out that this experiment shows that the choice of q 
has little impact on the algorithms performance. 

In Figure 13 we have the impact of the worker memory size on the performance of the 
algorithms. As expected, the performance increases with the amount of memory available. 
It is interesting to underline that our resource selection always performs in the best possible 
way. HoLM will use respectively two and four workers when the memory available increases, 
compared to the other algorithms which will use all eight available workers on each test. 
OMMOML also makes some resource selection, but it performs worse. 

Finally, Figure 11 shows the difference that we can have between two runs. This difference 
is around 6%. Thus if two algorithms have less than 6% of difference in execution time, they 
should be considered as similar. 

To conclude, these experiments stress the superiority of our memory allocation. Further- 
more, our homogeneous algorithm is as competitive as the others but uses fewer resources. 
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Figure 10: Performance of the algorithms 
on different matrices. 
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9 Related work 

In this section, we provide a brief overview of related papers, which we classify along the 
following five main lines: 

Load balancing on heterogeneous platforms — Load balancing strategies for hetero- 
geneous platforms have been widely studied. Distributing the computations (together 
with the associated data) can be performed either dynamically or statically, or a mix- 
ture of both. Some simple schedulers are available, but they use naive mapping strate- 
gies such as master-worker techniques or paradigms based upon the idea "use the past 
to predict the future", i.e. use the currently observed speed of computation of each ma- 
chine to decide for the next distribution of work [17, 18, 9]. Dynamic strategies such 
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as self-guided scheduling [34] could be useful too. There is a challenge in determining 
a trade-off between the data distribution parameters and the process spawning and 
possible migration policies. Redundant computations might also be necessary to use a 
heterogeneous cluster at its best capabilities. However, dynamic strategies are outside 
the scope of this paper (but mentioned here for the sake of completeness) . Because we 
have a library designer's perspective, we concentrate on static allocation schemes that 
are less general and more difficult to design than dynamic approaches, but which are 
better suited for the implementation of fixed algorithms such as linear algebra kernels 
from the ScaLAPACK library [13]. 

Out-of-core linear algebra routines — As already mentioned, the design of parallel al- 
gorithms for limited memory processors is very similar to the design of out-of-core 
routines for classical parallel machines. On the theoretical side. Hong and Kung [26] 
investigate the I/O complexity of several computational kernels in their pioneering 
paper. Toledo [38] proposes a nice survey on the design of out-of-core algorithms for 
linear algebra, including dense and sparse computations. We refer to [38] for a com- 
plete list of implementations. The design principles followed by most implementations 
are introduced and analyzed by Dongarra et al. [22] . 

Linear algebra algorithms on heterogeneous clusters — Several authors have dealt 
with the static implementation of matrix-multiplication algorithms on heterogeneous 
platforms. One simple approach is given by Kalinov and Lastovetsky [29]. Their idea is 
to achieve a perfect load-balance as follows: first they take a fixed layout of processors 
arranged as a collection of processor columns; then the load is evenly balanced within 
each processor column independently; next the load is balanced between columns; this 
is the "heterogeneous block cyclic distribution" of [29] . Another approach is proposed 
by Crandall and Quinn [20], who propose a recursive partitioning algorithm, and by 
Kaddoura, Ranka and Wang [28], who refine the latter algorithm and provide several 
variations. They report several numerical simulations. As pointed out in the introduc- 
tion, theoretical results for matrix multiplication and LU decomposition on 2D-grids of 
heterogeneous processors are reported in [5] , while extensions to general 2D partition- 
ing are considered in [6]. See also Lastovetsky and Reddy [31] for another partitioning 
approach. 

Recent papers aim at making easier the process of tuning linear algebra kernels on 
heterogeneous systems. Self-optimization methodologies are described by Cuenca et 
al [21] and by Chen et al [16]. Along the same line, Chakravarti et al. [15] describe an 
implementation of Cannon's algorithm using self-organizing agents on a peer-to-peer 
network. 

Models for heterogeneous platforms — In the literature, one-port models come in two 
variants. In the unidirectional variant, a processor cannot be involved in more than 
one communication at a given time-step, either a send or a receive. This is the model 
that we have used throughout the paper. In the bidirectional model, a processor can 
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send and receive in parallel, but at most to a given neighbor in each direction. In both 
variants, if sends a message to P^, both P„ and Py are blocked throughout the 
communication. 

The bidirectional one-port model is used by Bhat et al. [10, 11] for fixed-size mes- 
sages. They advocate its use because "current hardware and software do not easily 
enable multiple messages to be transmitted simultaneously." Even if non-blocking, 
multi-threaded communication libraries allow for initiating multiple send and receive 
operations, they claim that all these operations "are eventually serialized by the single 
hardware port to the network." Experimental evidence of this fact has recently been 
reported by Saif and Parashar [35], who report that asynchronous MPI sends get se- 
rialized as soon as message sizes exceed a few megabytes. Their results hold for two 
popular MPI implementations, MPICH on Linux clusters and IBM MPI on the SP2. 

The one-port model fully accounts for the heterogeneity of the platform, as each link 
has a different bandwidth. It generalizes a simpler model studied by Banikazemi et 
al. [1] Liu [32] and KhuUer and Kim [30]. In this simpler model, the communication 
time only depends on the sender, not on the receiver. In other words, the communi- 
cation speed from a processor to all its neighbors is the same. 

Finally, we note that some papers [2, 4] depart form the one-port model as they allow 
a sending processor to initiate another communication while a previous one is still 
on-going on the network. However, such models insist that there is an overhead time 
to pay before being engaged in another operation, so they are not allowing for fully 
simultaneous communications. 

Master- worker on the computational grid — Master- worker scheduling on the grid can 
be based on a network-flow approach [37, 36] or on an adaptive strategy [24]. Note 
that the network-flow approach of [37, 36] is possible only when using a full multiple- 
port model, where the number of simultaneous communications for a given node is 
not bounded. This approach has also been studied in [25]. Enabling frameworks to 
facilitate the implementation of master- worker tasking are described in [23, 39]. 

10 Conclusion 

The main contributions of this paper are the following: 

1. On the theoretical side, we have derived a new, tighter, bound on the minimal volume 
of communications needed to multiply two matrices. From this lower bound, we have 
deflned an efficient memory layout, i.e., an algorithm to share the memory available 
on the workers among the three matrices. 

2. On the practical side, starting from our memory layout, we have designed an algorithm 
for homogeneous platforms whose performance is quite close to the communication 
volume lower bound. We have extended this algorithm to deal with heterogeneous 
platforms, and discussed how to adapt the approach for LU factorization. 
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3. Through MPI experiments, we have shown that our algorithm for homogeneous plat- 
forms has far better performance than solutions using the memory layout proposed 
in [38]. Furthermore, this static homogeneous algorithm has similar performance as 
dynamic algorithms using the same memory layout, but uses fewer processors. It is 
therefore a very good candidate for deploying applications on regular, homogeneous 
platforms. 

We are currently conducting experiments to assess the performance of the extension of 
the algorithm for heterogeneous clusters. 
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